Practical Synthetic Data Generation by Khaled El Emam Lucy Mosquera Richard Hoptroff
Author:Khaled El Emam, Lucy Mosquera, Richard Hoptroff
Language: eng
Format: mobi, pdf, pdf
Publisher: O'Reilly Media, Inc.
Published: 2020-05-19T00:00:00+00:00
Chapter 5. Methods for Synthesizing Data
After describing some basic methods for distribution fitting in the last chapter, we will now use these concepts to generate synthetic data. We will start off with some basic approaches and build up to some more complex ones as the chapter progresses. We will refer to more advanced techniques later on that are beyond the scope of an introductory text, but what we cover should give you a good introduction.
Generating Synthetic Data from Theory
Letâs consider the situation where the analyst does not have any real data to start off with, but has some understanding of the phenomenon that they want to model and generate data for. For example, letâs say that we want to generate data reflecting the relationship between height and weight. It is generally known that height and weight are positively associated.
According to the Centers for Disease Control, the average height for men in the US is approximately 175 cm,1 and for the sake of our example we will assume a standard deviation of 5 cm. The average weight is 89.7 kg, and we will assume a standard deviation of 10 kg. For the sake of our example, we will model these as normal (Gaussian or bell-shaped) distributions and assume that the correlation between them is 0.5. According to Cohenâs guidelines for the interpretation of effect sizes, a correlation of magnitude equal to 0.5 is considered to be large, 0.3 is considered to be medium, and 0.1 is considered to be small. Any correlation above 0.5 would be a strong correlation in practice.2 Therefore, at 0.5 we are assuming a large correlation between height and weight. Based on these specifications, we can create a dataset of 5,000 observations that models this phenomenon.
We will present three ways to do this: (a) sampling from multivariate (normal) distributions, (b) inducing a correlation during the sampling process, and (c) using copulas. Each will be illustrated below.
Download
Practical Synthetic Data Generation by Khaled El Emam Lucy Mosquera Richard Hoptroff.pdf
Practical Synthetic Data Generation by Khaled El Emam Lucy Mosquera Richard Hoptroff.pdf
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Sass and Compass in Action by Wynn Netherland Nathan Weizenbaum Chris Eppstein Brandon Mathis(7808)
Grails in Action by Glen Smith Peter Ledbrook(7719)
Azure Containers Explained by Wesley Haakman & Richard Hooper(6807)
Configuring Windows Server Hybrid Advanced Services Exam Ref AZ-801 by Chris Gill(6803)
Running Windows Containers on AWS by Marcio Morales(6323)
Kotlin in Action by Dmitry Jemerov(5089)
Microsoft 365 Identity and Services Exam Guide MS-100 by Aaron Guilmette(5051)
Combating Crime on the Dark Web by Nearchos Nearchou(4623)
Microsoft Cybersecurity Architect Exam Ref SC-100 by Dwayne Natwick(4575)
Management Strategies for the Cloud Revolution: How Cloud Computing Is Transforming Business and Why You Can't Afford to Be Left Behind by Charles Babcock(4437)
The Ruby Workshop by Akshat Paul Peter Philips Dániel Szabó and Cheyne Wallace(4314)
The Age of Surveillance Capitalism by Shoshana Zuboff(3977)
Python for Security and Networking - Third Edition by José Manuel Ortega(3875)
The Ultimate Docker Container Book by Schenker Gabriel N.;(3534)
Learn Windows PowerShell in a Month of Lunches by Don Jones(3528)
Learn Wireshark by Lisa Bock(3491)
Mastering Python for Networking and Security by José Manuel Ortega(3376)
Mastering Azure Security by Mustafa Toroman and Tom Janetscheck(3353)
Blockchain Basics by Daniel Drescher(3322)
